Introduction

Descriptive Analysis

The raw dataset contains 7,728,394 observations (rows) of 46 variables (columns).

After data preparation and cleaning, the dataset contains 7,546,771 observations (rows) of 59 variables (columns).

Severity Number of Accidents
least severe 66121
less severe 6010987
more severe 1272321
most severe 197342

The author defines severity as “the impact on traffic.” Low severity accidents would have a minimal effect on traffic whereas high severity accidents would have a significant impact on traffic.

We can observe that the majority of accidents that took place between 2016 and 2023 were categorized as “less severe,” accounting for 6,010,987 of the total 7,546,771 accidents.

Statistical Analysis

Correlation Analysis of Key Quantitative Features

The heatmap shows the correlation between quantitative features such as temperature, wind chill, visibility, precipitation, and severity. Temperature and wind chill were nearly perfectly correlated (\(r = 0.99\)), as expected. However, severity had only weak correlations with all other variables, suggesting that accident severity is influenced by additional factors beyond those measured here.

numeric_data <- acc %>%
  select(Severity, visibility, temperature, wind_chill, precipitation) %>%
  na.omit()

cor_matrix <- cor(numeric_data)
cor_melted <- melt(cor_matrix)

ggplot(cor_melted, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Correlation") +
  geom_text(aes(label = round(value, 2)), color = "black", size = 4) +
  theme_minimal() +
  labs(title = "Correlation Heatmap of Numerical Features", x = "", y = "")

ANOVA on Accident Severity by Weather Condition

A one-way ANOVA was conducted to examine whether accident severity differs by weather condition. The results showed a statistically significant effect of weather on accident severity, \(F(4, 1,\!814,\!823) = 18,\!549\), \(p < .001\), indicating that the average severity of accidents varies across different weather conditions.

anova_data <- acc %>%
  filter(!is.na(weather) & !is.na(Severity)) %>%
  filter(weather %in% c("Clear", "Cloudy", "Rain", "Snow", "Fog"))

# Run ANOVA
anova_result <- aov(Severity ~ weather, data = anova_data)

# Output ANOVA table
summary(anova_result)
##                  Df Sum Sq Mean Sq F value Pr(>F)    
## weather           4  18624    4656   18549 <2e-16 ***
## Residuals   1814823 455533       0                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

T-Tests on Severity and Frequency for Holidays

Severity on Specific Holidays T-Test

A Welch two-sample t-test was conducted to compare accident severity on specific holidays versus other days. The results showed a statistically significant difference in severity scores, \(t(93,\!469) = 2.50\), \(p = .0125\). The average severity on non-holidays (\(M = 2.212\)) was slightly higher than on holidays (\(M = 2.208\)), with a 95% confidence interval for the difference in means ranging from 0.0009 to 0.0073.

Frequency on Specific Holidays T-Test

A Welch two-sample t-test was also conducted to examine differences in the average number of accidents per day on holidays versus non-holidays. The results were statistically significant, \(t(43.04) = 3.27\), \(p = .0021\). The mean number of accidents per day was higher on non-holidays (\(M = 2,\!947\)) compared to holidays (\(M = 2,\!173\)), with a 95% confidence interval for the difference in means ranging from 297 to 1,!250.

library(lubridate)
library(dplyr)
library(ggplot2)

# fixed date holidays
custom_holidays <- c("01-01", # New Year's Day
                     "07-04", # Independence Day
                     "12-25") # Christmas

# floating holidays
get_floating_holidays <- function(years) {
  holidays <- c()
  
  for (y in years) {
    # Thanksgiving: 4th Thursday in November
    thanksgiving <- as.Date(cut(as.Date(paste0(y, "-11-01")) + weeks(3), "week")) + 4
    while (weekdays(thanksgiving) != "Thursday") {
      thanksgiving <- thanksgiving + 1
    }
    
    # Memorial Day: last Monday in May
    memorial_day <- as.Date(paste0(y, "-05-31"))
    while (weekdays(memorial_day) != "Monday") {
      memorial_day <- memorial_day - 1
    }
    
    # Labor Day: first Monday in September
    labor_day <- as.Date(paste0(y, "-09-01"))
    while (weekdays(labor_day) != "Monday") {
      labor_day <- labor_day + 1
    }
    
    holidays <- c(holidays, thanksgiving, memorial_day, labor_day)
  }
  
  as.Date(holidays)
}

# full holiday list
years <- 2016:2023
floating_days <- get_floating_holidays(years)

fixed_days <- do.call(c, lapply(years, function(y) {
  as.Date(paste0(y, "-", custom_holidays))
}))

# Combine all holidays
specific_holidays <- sort(c(fixed_days, floating_days))

# Flag holidays in accident data
acc$holiday_specific <- acc$date_ %in% specific_holidays


# T-Test: Severity
t_test_severity <- t.test(Severity ~ holiday_specific, data = acc)
print("T-test on Severity (Specific Holidays):")
## [1] "T-test on Severity (Specific Holidays):"
print(t_test_severity)
## 
##  Welch Two Sample t-test
## 
## data:  Severity by holiday_specific
## t = 2.4975, df = 93469, p-value = 0.01251
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  0.0008838382 0.0073295171
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            2.212178            2.208071
# Frequency per day
acc_day <- acc %>%
  group_by(date_) %>%
  summarise(n_acc = n(), holiday_specific = any(holiday_specific)) %>%
  ungroup()

t_test_freq <- t.test(n_acc ~ holiday_specific, data = acc_day)
print("T-test on Frequency (Specific Holidays):")
## [1] "T-test on Frequency (Specific Holidays):"
print(t_test_freq)
## 
##  Welch Two Sample t-test
## 
## data:  n_acc by holiday_specific
## t = 3.2727, df = 43.041, p-value = 0.002105
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##   296.8096 1249.9020
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            2946.832            2173.476

Visualization: Severity on Holidays

Although the difference is small, the chart shows a slightly higher average severity for accidents on non-holidays compared to holidays. The mean severity was 2.212 on non-holidays and 2.208 on holidays. The corresponding Welch t-test (\(t(93,\!469) = 2.50\), \(p = .0125\)) confirms that this difference is statistically significant, although not practically large. This suggests that while there are fewer accidents on holidays, they are not necessarily more or less severe.

acc %>%
  group_by(holiday_specific) %>%
  summarise(mean_severity = mean(Severity)) %>%
  ggplot(aes(x = holiday_specific, y = mean_severity, fill = holiday_specific)) +
  geom_col() +
  labs(title = "Average Severity on Specific Holidays vs. Other Days", y = "Avg Severity", x = "Is Specific Holiday")

Visualization: Frequency of Accidents on Holidays

The bar chart clearly shows that the average number of accidents per day is significantly lower on specific holidays compared to non-holiday dates. On average, there were around 2,173 accidents per day on holidays versus 2,947 on non-holidays. This visual supports the results of the Welch two-sample t-test (\(t(43.04) = 3.27\), \(p = .0021\)), confirming that this difference is statistically significant. The lower volume on holidays may reflect reduced traffic due to time off from work and school.

acc_day %>%
  group_by(holiday_specific) %>%
  summarise(mean_accidents = mean(n_acc)) %>%
  ggplot(aes(x = holiday_specific, y = mean_accidents, fill = holiday_specific)) +
  geom_col() +
  labs(title = "Average Daily Accidents: Specific Holidays vs. Other Days", y = "Avg Accidents/Day", x = "Is Specific Holiday")